{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _phone_userguide:\n",
    "\n",
    "Phone Numbers\n",
    "============="
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Introduction\n",
    "------------\n",
    "\n",
    "The function :func:`clean_phone() <dataprep.clean.clean_phone.clean_phone>` cleans and standardizes a DataFrame column containing phone numbers. The function :func:`validate_phone() <dataprep.clean.clean_phone.validate_phone>` validates either a single phone number or a column of phone numbers, returning True if the value is valid, and False otherwise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Currently, Canadian/US phone numbers having the following format are supported as valid input:\n",
    "\n",
    "* Country code of \"1\" (optional)\n",
    "* Three-digit area code (optional)\n",
    "* Three-digit central office code\n",
    "* Four-digit station code\n",
    "* Extension number preceded by \"#\", \"x\", \"ext\", or \"extension\" (optional)\n",
    "\n",
    "A combination of numbers and uppercase letters is allowed within the central office code and the station code.\n",
    "\n",
    "Various delimiters between the digits are also allowed, such as spaces, hyphens, periods, brackets, and/or forward slashes.\n",
    "\n",
    "Phone numbers can be converted to the following formats via the `output_format` parameter:\n",
    "\n",
    "* North American Numbering Plan (nanp): NPA-NXX-XXXX\n",
    "* E.164 (e164): +1NPANXXXXXX\n",
    "* national: (NPA) NXX-XXXX\n",
    "\n",
    "Invalid parsing is handled with the `errors` parameter:\n",
    "\n",
    "* \"coerce\" (default): invalid parsing will be set to NaN\n",
    "* \"ignore\": invalid parsing will return the input\n",
    "* \"raise\": invalid parsing will raise an exception\n",
    "\n",
    "After cleaning, a **report** is printed that provides the following information:\n",
    "\n",
    "* How many values were cleaned (the value must have been transformed).\n",
    "* How many values could not be parsed.\n",
    "* A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.\n",
    "\n",
    "The following sections demonstrate the functionality of `clean_phone()` and `validate_phone()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An example dataset containing phone numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame({\n",
    "    \"phone\": [\n",
    "        \"555-234-5678\", \"(555) 234-5678\", \"555.234.5678\", \"555/234/5678\",\n",
    "        15551234567, \"(1) 555-234-5678\", \"+1 (234) 567-8901 x. 1234\",\n",
    "        \"2345678901 extension 1234\", \"2345678\", \"800-299-JUNK\", \"1-866-4ZIPCAR\",\n",
    "        \"123 ABC COMPANY\", \"+66 91 889 8948\", \"hello\", np.nan, \"NULL\"\n",
    "    ]\n",
    "})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Default `clean_phone()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the `output_format` parameter is set to \"nanp\" (NPA-NXX-XXXX) and the `errors` parameter is set to \"coerce\" (set to NaN when parsing is invalid)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_phone\n",
    "clean_phone(df, \"phone\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that \"555-234-5678\" is considered not cleaned in the report since its resulting format is the same as the input. Also, \"+66 91 889 8948\" is invalid because it is not a Canadian or US phone number.\n",
    "\n",
    "The letters in \"800-299-JUNK\" and \"1-866-4ZIPCAR\" are automatically converted to their number equivalents on a telephone keypad."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Output formats\n",
    "\n",
    "This section demonstrates the supported phone number formats.\n",
    "\n",
    "### E.164 (e164)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_phone(df, \"phone\", output_format=\"e164\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the country code \"+1\" is not added to \"2345678\" as this would result in an invalid Canadian or US phone number."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### national"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_phone(df, \"phone\", output_format=\"national\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. `split` parameter\n",
    "\n",
    "The `split` parameter adds individual columns containing the cleaned phone number values to the given DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_phone(df, \"phone\", split=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. `fix_missing` parameter\n",
    "\n",
    "By default, the `fix_missing` parameter is set to \"empty\" (leave the missing country code as is). If set to \"auto\", the country code is set to \"1\".\n",
    "\n",
    "### `split` and `fix_missing`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_phone(df, \"phone\", split=True, fix_missing=\"auto\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, note that the country code is not set to \"1\" for \"2345678\" as this would result in an invalid Canadian or US phone number."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. `inplace` parameter\n",
    "\n",
    "This deletes the given column from the returned DataFrame. \n",
    "A new column containing cleaned phone numbers is added with a title in the format `\"{original title}_clean\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_phone(df, \"phone\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `inplace` and `split`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_phone(df, \"phone\", split=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `inplace`, `split` and `fix_missing`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_phone(df, \"phone\", split=True, inplace=True, fix_missing=\"auto\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. `validate_phone()` \n",
    "\n",
    "`validate_phone()` returns True when the input is a valid phone number. Otherwise it returns False.\n",
    "Valid types are the same as `clean_phone()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_phone\n",
    "print(validate_phone(1234))\n",
    "print(validate_phone(2346789))\n",
    "print(validate_phone(\"1 800 234 6789\"))\n",
    "print(validate_phone(\"+44 7700 900077\"))\n",
    "print(validate_phone(\"555-234-6789 ext 32\"))\n",
    "print(validate_phone(\"1-866-4ZIPCAR\"))\n",
    "print(validate_phone(\"123 ABC COMPANY\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\n",
    "    \"phone\": [\n",
    "        \"555-234-5678\", \"(555) 234-5678\", \"555.234.5678\", \"555/234/5678\",\n",
    "        15551234567, \"(1) 555-234-5678\", \"+1 (234) 567-8901 x. 1234\",\n",
    "        \"2345678901 extension 1234\", \"2345678\", \"800-299-JUNK\", \"1-866-4ZIPCAR\",\n",
    "        \"123 ABC COMPANY\", \"+66 91 889 8948\", \"hello\", np.nan, \"NULL\"\n",
    "    ]\n",
    "})\n",
    "df[\"valid\"] = validate_phone(df[\"phone\"])\n",
    "df"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
